Skip to content

Conversation

@hpatro
Copy link
Collaborator

@hpatro hpatro commented Apr 2, 2025

New fields in CLUSTER INFO:

  • cluster_nodes_pfail
  • cluster_nodes_fail
  • cluster_voting_nodes_pfail
  • cluster_voting_nodes_fail

I'm running few tests and trying to capture partially failed and completely failed count. Slot partially failed / completely failed stats exists but is more difficult to assess the node failure count with that.

New output:

> CLUSTER INFO
cluster_state:fail
cluster_slots_assigned:0
cluster_slots_ok:0
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_nodes_pfail:1
cluster_nodes_fail:0
cluster_voting_nodes_pfail:1
cluster_voting_nodes_fail:0
cluster_known_nodes:3
cluster_size:0
cluster_current_epoch:1
cluster_my_epoch:1
cluster_stats_messages_ping_sent:2104
cluster_stats_messages_pong_sent:1906
cluster_stats_messages_meet_sent:1
cluster_stats_messages_sent:4011
cluster_stats_messages_ping_received:1906
cluster_stats_messages_pong_received:1964
cluster_stats_messages_received:3870
total_cluster_links_buffer_limit_exceeded:0

@hpatro hpatro changed the title Add cluster info metrics for node pfail and fail count Add node pfail and fail count to cluster info metrics Apr 3, 2025
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

So the use case is to be able to write tests more reliably, or is there a "real" use case?

@hpatro
Copy link
Collaborator Author

hpatro commented Apr 3, 2025

Makes sense.

So the use case is to be able to write tests more reliably, or is there a "real" use case?

I'm trying to observe first time to node failure detection and time to mark it as complete failure. Without this data, it seems difficult to modify the algorithm and observe the change in behavior.

@zuiderkwast
Copy link
Contributor

Observability of failure detection. It's a great concept! ;) Yeah it can be useful for users too, not only for us.

Signed-off-by: Harkrishn Patro <[email protected]>
@hpatro
Copy link
Collaborator Author

hpatro commented Apr 3, 2025

Signed-off-by: Harkrishn Patro <[email protected]>
@codecov
Copy link

codecov bot commented Apr 3, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.99%. Comparing base (f1d8d77) to head (bf9ddf0).
Report is 22 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1910      +/-   ##
============================================
- Coverage     71.03%   70.99%   -0.05%     
============================================
  Files           123      123              
  Lines         65682    65721      +39     
============================================
- Hits          46660    46656       -4     
- Misses        19022    19065      +43     
Files with missing lines Coverage Δ
src/cluster_legacy.c 86.11% <100.00%> (+0.05%) ⬆️

... and 21 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@valkey-io/core-team Please ack ( 👍 ) two new fields in CLUSTER INFO.

@zuiderkwast zuiderkwast added the major-decision-pending Major decision pending by TSC team label Apr 3, 2025
@hpatro hpatro added the cluster label Apr 7, 2025
Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we think having this information is better than just asking end users to run cluster nodes/shards and count the number of failed/pfail nodes? I'm a little worried about end users alarming on this metric, even though it includes nodes that aren't part of quorum and aren't serving any traffic.

@hpatro
Copy link
Collaborator Author

hpatro commented Apr 8, 2025

Do we think having this information is better than just asking end users to run cluster nodes/shards and count the number of failed/pfail nodes? I'm a little worried about end users alarming on this metric, even though it includes nodes that aren't part of quorum and aren't serving any traffic.

With large cluster I would prefer not pulling cluster nodes/shards output and compute this.
Thinking more on this, I think I would also need the same stats around voting members. Maybe users can alarm on those. 😉

Signed-off-by: Harkrishn Patro <[email protected]>
@hpatro
Copy link
Collaborator Author

hpatro commented Apr 8, 2025

I've added voting nodes pfail/fail as well. If we decouple voting nodes from data serving node (primary) within the same architecture in the future, will have to add two additional metric (primary_fail / primary_pfail).

@madolson let me know your thoughts.

Signed-off-by: Harkrishn Patro <[email protected]>
@madolson madolson added needs-doc-pr This change needs to update a documentation page. Remove label once doc PR is open. major-decision-approved Major decision approved by TSC team and removed major-decision-pending Major decision pending by TSC team labels Apr 14, 2025
@madolson madolson removed the needs-doc-pr This change needs to update a documentation page. Remove label once doc PR is open. label Apr 15, 2025
@madolson madolson merged commit 30dc9a7 into valkey-io:unstable Apr 15, 2025
51 checks passed
@madolson madolson added the release-notes This issue should get a line item in the release notes label Apr 15, 2025
madolson pushed a commit to valkey-io/valkey-doc that referenced this pull request Apr 15, 2025
nitaicaro pushed a commit to nitaicaro/valkey that referenced this pull request Apr 22, 2025
New fields in CLUSTER INFO:

* `cluster_nodes_pfail`
* `cluster_nodes_fail`
* `cluster_voting_nodes_pfail`
* `cluster_voting_nodes_fail`

I'm running few tests and trying to capture partially failed and
completely failed count. Slot partially failed / completely failed stats
exists but is more difficult to assess the node failure count with that.

New output:

```
> CLUSTER INFO
cluster_state:fail
cluster_slots_assigned:0
cluster_slots_ok:0
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_nodes_pfail:1
cluster_nodes_fail:0
cluster_voting_nodes_pfail:1
cluster_voting_nodes_fail:0
cluster_known_nodes:3
cluster_size:0
cluster_current_epoch:1
cluster_my_epoch:1
cluster_stats_messages_ping_sent:2104
cluster_stats_messages_pong_sent:1906
cluster_stats_messages_meet_sent:1
cluster_stats_messages_sent:4011
cluster_stats_messages_ping_received:1906
cluster_stats_messages_pong_received:1964
cluster_stats_messages_received:3870
total_cluster_links_buffer_limit_exceeded:0
```

---------

Signed-off-by: Harkrishn Patro <[email protected]>
Signed-off-by: Nitai Caro <[email protected]>
nitaicaro pushed a commit to nitaicaro/valkey that referenced this pull request Apr 22, 2025
New fields in CLUSTER INFO:

* `cluster_nodes_pfail`
* `cluster_nodes_fail`
* `cluster_voting_nodes_pfail`
* `cluster_voting_nodes_fail`

I'm running few tests and trying to capture partially failed and
completely failed count. Slot partially failed / completely failed stats
exists but is more difficult to assess the node failure count with that.

New output:

```
> CLUSTER INFO
cluster_state:fail
cluster_slots_assigned:0
cluster_slots_ok:0
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_nodes_pfail:1
cluster_nodes_fail:0
cluster_voting_nodes_pfail:1
cluster_voting_nodes_fail:0
cluster_known_nodes:3
cluster_size:0
cluster_current_epoch:1
cluster_my_epoch:1
cluster_stats_messages_ping_sent:2104
cluster_stats_messages_pong_sent:1906
cluster_stats_messages_meet_sent:1
cluster_stats_messages_sent:4011
cluster_stats_messages_ping_received:1906
cluster_stats_messages_pong_received:1964
cluster_stats_messages_received:3870
total_cluster_links_buffer_limit_exceeded:0
```

---------

Signed-off-by: Harkrishn Patro <[email protected]>
nitaicaro pushed a commit to nitaicaro/valkey that referenced this pull request Apr 22, 2025
New fields in CLUSTER INFO:

* `cluster_nodes_pfail`
* `cluster_nodes_fail`
* `cluster_voting_nodes_pfail`
* `cluster_voting_nodes_fail`

I'm running few tests and trying to capture partially failed and
completely failed count. Slot partially failed / completely failed stats
exists but is more difficult to assess the node failure count with that.

New output:

```
> CLUSTER INFO
cluster_state:fail
cluster_slots_assigned:0
cluster_slots_ok:0
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_nodes_pfail:1
cluster_nodes_fail:0
cluster_voting_nodes_pfail:1
cluster_voting_nodes_fail:0
cluster_known_nodes:3
cluster_size:0
cluster_current_epoch:1
cluster_my_epoch:1
cluster_stats_messages_ping_sent:2104
cluster_stats_messages_pong_sent:1906
cluster_stats_messages_meet_sent:1
cluster_stats_messages_sent:4011
cluster_stats_messages_ping_received:1906
cluster_stats_messages_pong_received:1964
cluster_stats_messages_received:3870
total_cluster_links_buffer_limit_exceeded:0
```

---------

Signed-off-by: Harkrishn Patro <[email protected]>
hwware pushed a commit to wuranxx/valkey that referenced this pull request Apr 24, 2025
New fields in CLUSTER INFO:

* `cluster_nodes_pfail`
* `cluster_nodes_fail`
* `cluster_voting_nodes_pfail`
* `cluster_voting_nodes_fail`

I'm running few tests and trying to capture partially failed and
completely failed count. Slot partially failed / completely failed stats
exists but is more difficult to assess the node failure count with that.

New output:

```
> CLUSTER INFO
cluster_state:fail
cluster_slots_assigned:0
cluster_slots_ok:0
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_nodes_pfail:1
cluster_nodes_fail:0
cluster_voting_nodes_pfail:1
cluster_voting_nodes_fail:0
cluster_known_nodes:3
cluster_size:0
cluster_current_epoch:1
cluster_my_epoch:1
cluster_stats_messages_ping_sent:2104
cluster_stats_messages_pong_sent:1906
cluster_stats_messages_meet_sent:1
cluster_stats_messages_sent:4011
cluster_stats_messages_ping_received:1906
cluster_stats_messages_pong_received:1964
cluster_stats_messages_received:3870
total_cluster_links_buffer_limit_exceeded:0
```

---------

Signed-off-by: Harkrishn Patro <[email protected]>
Signed-off-by: hwware <[email protected]>
@hpatro hpatro mentioned this pull request Jun 27, 2025
15 tasks
sarthakaggarwal97 pushed a commit to sarthakaggarwal97/valkey that referenced this pull request Sep 16, 2025
New fields in CLUSTER INFO:

* `cluster_nodes_pfail`
* `cluster_nodes_fail`
* `cluster_voting_nodes_pfail`
* `cluster_voting_nodes_fail`

I'm running few tests and trying to capture partially failed and
completely failed count. Slot partially failed / completely failed stats
exists but is more difficult to assess the node failure count with that.

New output:

```
> CLUSTER INFO
cluster_state:fail
cluster_slots_assigned:0
cluster_slots_ok:0
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_nodes_pfail:1
cluster_nodes_fail:0
cluster_voting_nodes_pfail:1
cluster_voting_nodes_fail:0
cluster_known_nodes:3
cluster_size:0
cluster_current_epoch:1
cluster_my_epoch:1
cluster_stats_messages_ping_sent:2104
cluster_stats_messages_pong_sent:1906
cluster_stats_messages_meet_sent:1
cluster_stats_messages_sent:4011
cluster_stats_messages_ping_received:1906
cluster_stats_messages_pong_received:1964
cluster_stats_messages_received:3870
total_cluster_links_buffer_limit_exceeded:0
```

---------

Signed-off-by: Harkrishn Patro <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cluster major-decision-approved Major decision approved by TSC team release-notes This issue should get a line item in the release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants